Assembly of metabarcode reads
Create Contigs from Paired-end Reads
We will begin by merging our reads into contigs, followed by filtering and trimming of reads based on quality score and several other metrics. In this experiment, paired-end sequencing of either the V5-V6 region of the 16S rRNA gene or the ITS2 (Internal transcribed spacer) region between the 5.8S and 28S rDNA of the fungal ribosome genes, was performed. These regions are around 316 bp and 210 bp in length, respectively.
Our reads are between 200 and 250 bp in length, therefore this results in a significant overlap between the forward and reverse reads in each pair. We will combine these pairs of reads to produce contigs.
Note, we will just focus on the 16S rRNA metabarcoding reads but it is important to remember the difference between different regions that are amplified using metabarcoding.
This will be achieved by using the Make.contigs tool from the Mothur toolsuite. Make.contigs will look at each pair, take the reverse complement reverse read, and then determine the overlap between the two sequences. Where an overlapping base call differs between the two reads, the quality score is used to determine the consensus base call. A new quality score is derived by combining the two original quality scores in both of the reads for all the overlapping positions.
To run Make.contigs we can use the default settings, but ensure you select to output the log file.
Data Cleaning
Quality is always a consideration, therefore our next step is to improve the quality of our data. Before doing so, we want to get an overview of our data, similarly to using FastQC. To get a summary of the contigs generated in the previous step, we use Summary.seqs.
Input for this step is the trim.contigs.fasta file from the Make.contigs step. We also want to select Output logfile, as we will use this to get a summary of our data.
As before, the output from Make.contigs has given us long, difficult to decipher names, therefore it would be good to rename the files. This will make it easier to keep track of our data. It is good practise to do this periodically while working through the different data cleaning steps.
The above summary result is from B_sample_98, which is a bacterial detection sample. Going through the results in the logfile, the key things we can gather are as follows:
- Almost all the contigs are between 204 and 339 bases long (
Nbasescolumn,Minimumand97.5%-tilerows) - 97.5% of our reads had no ambiguous base calls (
Ambigscolumn) - The longest contig in the dataset is 452 bases (
Nbasescolumn,Maximumrow) - There are 56,573 contigs in our dataset
Our region of interest for this sample is V5-V6 region of the 16S rRNA gene, which is only 316 bases long. Any reads significantly longer than this expected value likely did not assemble well in the Make.contigs step. Furthermore, we see that 2.5% of our reads have at most 21 ambiguous base calls (Ambigs column). In the next steps we will clean up our data by removing these problematic reads.
We do this data cleaning using the Screen.seqs tool, which removes:
- sequences with ambiguous bases (
maxambig) - contigs shorter than a given threshold (
minlength) - contigs longer than a given threshold (
maxlength)
Therefore for B_Sample_98, for running Screen.seqs we want to use the following values:
maxambig= 0minlength= 291maxlength= 341
Our input is the Make.contigs on data X: trim.contigs.fasta, we can use default settings for all other variables and we want to output the logfile.
Optimise files for computation
Microbiome samples typically contain large numbers of the same organism, and therefore we expect to find many identical sequences in our data. In order to speed up computation, we first determine the unique reads, and then record how many times each of these different reads was observed in the original dataset. We do this by using the Unique.seqs tool.
The input for the Unique.seqs is the “good” fasta produced from running Screen.seqs, we want the output file to be Name File and we want to toggle yes to output the logfile.
Here we see that this step has greatly reduced the size of our sequence file; not only will this speed up further computational steps, it will also greatly reduce the amount of disk space (and your Galaxy quota) needed to store all the intermediate files generated during this analysis.
This Unique.seqs tool created two files, one is a FASTA file containing only the unique sequences, and the second is a so-called names file. This names file consists of two columns, the first contains the sequence names for each of the unique sequences, and the second column contains all other sequence names that are identical to the representative sequence in the first column.
To recap, we now have the following files:
- a FASTA file containing every distinct sequence in our dataset (the representative sequences)
- a names file containing the list of duplicate sequences